Github Repo: https://github.com/tmbuza/microbiome-analysis.


Basic requirements



General Overview

  • We use the snakemake workflow management system[1,2] for:
    • Maintaining reproducibility in technical validation and regeneration of results
    • Creating scalable data analysis scaled to server, grid or cloud environment seamlessly.
    • Fostering sustainable improvement of the microbiome data analysis.
  • Reviewing existing workflows methodlogies[2,3] helps in gaining a better insights for improving microbiome data analysis.
  • We break any complex workflows into small contiguous but related chunks for simplicity reasons.
  • Each major step forms a separate executable snakemake rule and is represented in the DAG (Directed Acyclic Graph).


We envision to keep fostering on continuous integration and development of highly reproducible workflows.



Snakemake rule DAG



Screenshot of interactive snakemake report

The snakemake html report can be viewed using any compartible browser such as chrome to explore more on the workflow and the associated statistics. You will be able to close the left bar to get a better view of the dispaly.




Sample Metadata

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.


Metadata structure

  • Metadata collected at different stages (Figure 1) are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The metadata table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.


Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.


SRA metadata

  • Typically, after sequencing the microbiome DNA, the investigators are encouraged to deposit the sequence reads in a public repository. The Sequence Read Archive (SRA) is currently the best bioinformatics database for read information. The good thing about SRA is that it integrates data from the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).
  • This demo uses metadata associated with the bushmeat microbiome bioproject number PRJNA477349.
  • The metadata and the run accessions are available via the SRA database.

Demo: Downloading metadata associated with the bushmeat microbiome bioproject number PRJNA477349 Note that the SRA filename for metadata is automatically named SraRunTable.txt (not CSV).

Screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349


Filtering metadata

  • We want the sample metadata to include a few desired variables.
  • In this example we desire to include sample collection point i.e latitude and longitude.
  • We will drop row with NA in location column.
  • It is a good habit to rename, modify or replace longer column names with meaningful names.
  • We will select a few columns to create a desired metadata for downstream analyses.
  • Saving the tidy metadata file in RDS or RData format will preserve a compressed file but here we save a tidy file as CSV.
  • Script: workflow/scripts/tidy_metadata.

Explore the metadata

At this stage you will be able to select and decide what to include in the downstream analyses.



Sample collection points

Get latitudes and longitudes

  • Does the metadata contains latitudes and longitudes (lat-lon) of the collection point?
  • You may consider dropping a pin on exact location.

Draw interactive map

  • The leaflet R package can do a great job in dropping a pin on the corresponding coordinate.
  • Note that samples collected on the same coordinate will overlap.
  • You can zoom in-out to expand or minimize the map.
  • You can also mouse over the pin to see the variable label.

Static image from the interactive HTML map converted using the html2image package.

Read count distribution

Example 1: Bushmeat project

Bar chart depicting variables against read counts in bushmeat project.


Example 2: IBD project

Bar chart depicting variables against read counts in IBD project.




Analysis Software

Install R Environment

R is a free software for statistical computing, data analysis, and graphics[4]. We need to install R application on a personal computer to process the R programming language. You can download and install R using these steps:

  1. Go to https://www.r-project.org/.
  2. On the left, under Download, click on CRAN to access the mirrors. CRAN (Comprehensive R Archive Network) is mirrored on nearly 100 registered servers in nearly 50 regions world. See CRAN mirror status.
  3. https://cloud.r-project.org/ Pick a mirror that is close to your location, and automatically R will connect to that server ready to download the package files.
  4. Select a compatible platform to download precompiled binary distributions of the base system, which also come with contributed packages.


Install RStudio Environment

RStudio is a free program that integrates with R as an IDE (Integrated Development Environment) to implement most of the analytical functionalities[5]. For effective analysis, we must install R before installing RStudio. We will intensively use RStudio IDE to give us a user interface. We are interested in RStudio Desktop, which is the open-source regular desktop application. You can install it like this:

  1. Go to https://rstudio.com/products/rstudio/.
  2. Click on RStudio Desktop box to move to the open source edition.
  3. Choose your preferred license either open source or commercial.
  4. Select installer compatible to your operating system.


Screen shot of RStudio User Interface




Quality Control Tools

What tools are out there?

There several tools out there that can help in preprocessing raw read. Listed below are some of most common tools used in understanding the characteristics of the read and their quality scores. Click on the hyperlinked tool to learn more how to install it.

  • Seqkit for simple statistics[6].
  • FastQC for quality assessment of bases[7].
  • MultiQC for summarizing FastQC metrics[8].
  • BBMap platform for read trimming and decontamination[9].
  • Kneaddata in biobakery platform for integrated quality control[10].

Note that the links for each tool may be outdated. Make sure to check for latest instructions online.




Sequencing Data

Quick glimpse

Read sequencing data may be obtained from different sources. The most common ones include:

  1. Reads from sequencing platforms for research purposes.
  2. Reads downloaded from the Sequence Read Archive (SRA) using the SraToolkit functions.
  3. Reads synthesized using special simulation software such as InSilicoSeq[11,12].
  • Most insilico data is used for testing software before using real data.
  • Generating insilico data can be challenging but can provide a starting data for testing some pipelines.

InSilicoSeq comes with a set of pre-computed error models to allow the user to easily generate reads from the most popular Illumina instruments, including:
- HiSeq - MiSeq - NovaSeq


How to resize Fastq Files

  • Sometimes we want to extract a small subset to test the bioinformatics pipeline.
  • You can resize the fastq files using the seqkit sample function[13].
  • Below is a quick demo for extracting only 1% of the paired-end metagenomics sequencing data.

This example extract 1% of the reads in only two sample (SRR10245277 & SRR10245278)

mkdir -p data
for i in {77..78}
  do
    cat SRR102452$i\_R1.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_R1_sub.fastq \
    | cat SRR102452$i\_R2.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_R2_sub.fastq
  done



Bioinformatics Pipelines

Mothur pipeline

  • Most famous for profiling microbial composition from 16S rRNA sequencing data.
  • Mothur is an open-source software package for bioinformatics data processing.
  • It is a very famous platform comprised of over 145 tools that can be integrated to for a desired pipeline.
  • Mothur has a basic tutorial that help users get started with 16S rRNA gene analysis.
  • We can download a stable platform from here.

QIIME2 pipeline

  • Most famous for profiling microbial composition from 16S rRNA sequencing data.
  • QIIME2 is an open-source microbiome analysis platform with integrated software for quality control such as DADA2.
  • It is a very famous platform with an active community forum.
  • QIIME2 has profound tutorials that help users get started with 16S rRNA gene analysis.
  • We can install the latest version from here.

MetaPhlAn standalone pipeline

  • MetaPhlAn is an open-source pipeline for taxonomic profiling from metagenomic shotgun sequencing data.
  • MetaPhlAn tutorial provide a step-by-step guidance for taxonomic profiling from different environmental samples.

MetaPhlAn within HUMAnN pipeline

  • HUMAnN (the HMP Unified Metabolic Analysis Network) is an open-source pipeline for functional profiling from metagenomic sequencing data.
  • HUMAnN tutorial provide a step-by-step guidance for functional profiling.



Mapping Files

Mothur sample mapping file

The mapping files are required to direct the pipeline where to look for the files containing the sequencing data.

  • The format of the mapping files for mothur and QIIME2 pipelines is slightly different.
  • We will demonstrate how to prepare both.
  • Each mapping file will contain only the files that are parsed by bioinformatics analysis.


# A tibble: 2 × 3
  sample_id   forward             reverse            
  <chr>       <chr>               <chr>              
1 SRR10245277 SRR10245277_1.fastq SRR10245277_2.fastq
2 SRR10245278 SRR10245278_1.fastq SRR10245278_2.fastq


Mothur design file

  • Mothur pipeline expects the design (metadata) file to have column headers.
  • We will extract only the desired number of samples.
  • The first column header should be group.
  • For more detail see Mothur MiSeq SOP.
# A tibble: 2 × 2
  group       var1     
  <chr>       <chr>    
1 SRR10245277 Serengeti
2 SRR10245278 Serengeti




Appendix


Project directories

[ 832]  .
├── [1.0K]  LICENSE
├── [  44]  README.md
├── [6.5K]  Rplots.pdf
├── [1.3M]  bioinformatics.png
├── [  96]  config
│   └── [1.3K]  config.yaml
├── [ 128]  css
│   └── [3.4K]  styles.css
├── [ 224]  dags
│   ├── [ 27K]  dag.png
│   ├── [7.0K]  dag.svg
│   ├── [ 73K]  rulegraph.png
│   └── [ 12K]  rulegraph.svg
├── [ 192]  data
│   ├── [  87]  design.tsv
│   ├── [ 192]  metadata
│   │   ├── [ 56K]  SraRunTable.csv
│   │   ├── [8.1K]  metadata.csv
│   │   ├── [  55]  mothur_design_file.tsv
│   │   └── [ 130]  mothur_mapping_file.tsv
│   └── [ 224]  reads
│       ├── [204M]  SRR10245277_1.fastq
│       ├── [204M]  SRR10245277_2.fastq
│       ├── [281M]  SRR10245278_1.fastq
│       ├── [280M]  SRR10245278_2.fastq
│       └── [ 192]  test
│           ├── [1.9M]  SRR10245277_1_sub.fastq
│           ├── [1.9M]  SRR10245277_2_sub.fastq
│           ├── [2.6M]  SRR10245278_1_sub.fastq
│           └── [2.6M]  SRR10245278_2_sub.fastq
├── [ 640]  images
│   ├── [252K]  RStudioIDE.png
│   ├── [1.3M]  bioinformatics.png
│   ├── [ 81K]  bkgd1.png
│   ├── [1.1M]  bkgd2.png
│   ├── [ 76K]  cicd.png
│   ├── [431K]  imap-part1.png
│   ├── [8.3K]  metadata.png
│   ├── [443K]  planning.png
│   ├── [   0]  project_tree.txt
│   ├── [ 11K]  sample_gps.html
│   ├── [1.4M]  sample_gps.png
│   ├── [ 416]  sample_gps_files
│   │   ├── [  96]  Proj4Leaflet-1.0.1
│   │   │   └── [7.6K]  proj4leaflet.js
│   │   ├── [  96]  htmlwidgets-1.6.1
│   │   │   └── [ 32K]  htmlwidgets.js
│   │   ├── [  96]  jquery-1.12.4
│   │   │   └── [ 95K]  jquery.min.js
│   │   ├── [ 160]  leaflet-1.3.1
│   │   │   ├── [ 224]  images
│   │   │   ├── [ 14K]  leaflet.css
│   │   │   └── [136K]  leaflet.js
│   │   ├── [  96]  leaflet-binding-2.1.1
│   │   │   └── [ 93K]  leaflet.js
│   │   ├── [  96]  leaflet-providers-1.9.0
│   │   │   └── [ 24K]  leaflet-providers_1.9.0.js
│   │   ├── [  96]  leaflet-providers-plugin-2.1.1
│   │   │   └── [ 185]  leaflet-providers-plugin.js
│   │   ├── [  96]  leafletfix-1.0.0
│   │   │   └── [ 642]  leafletfix.css
│   │   ├── [  96]  proj4-2.6.2
│   │   │   └── [ 75K]  proj4.min.js
│   │   └── [ 128]  rstudio_leaflet-1.3.1
│   │       ├── [  96]  images
│   │       └── [1.1K]  rstudio_leaflet.css
│   ├── [  96]  smkreport
│   │   └── [122K]  screenshot.png
│   ├── [1.1M]  sra_run_selector.png
│   ├── [291K]  variable_freq.png
│   └── [ 46K]  variable_freq.svg
├── [5.7K]  index.Rmd
├── [6.1M]  index.html
├── [ 128]  library
│   ├── [ 23K]  apa.csl
│   └── [ 19K]  references.bib
├── [ 205]  microbiome-analysis.Rproj
├── [2.7M]  report.html
├── [  96]  resources
├── [ 192]  results
│   ├── [ 156]  read_size_asce.csv
│   ├── [ 149]  read_size_desc.csv
│   └── [  96]  stats1
│       └── [ 475]  seqkit_stats.txt
└── [ 288]  workflow
    ├── [5.0K]  Snakefile
    ├── [ 128]  envs
    │   └── [ 335]  environment.yaml
    ├── [  64]  notebooks
    ├── [  96]  reports
    ├── [ 224]  rules
    │   ├── [2.0K]  imap_part1.smk
    │   ├── [ 721]  render_index.smk
    │   ├── [1.2K]  resize_test_subset.smk
    │   └── [ 124]  rules_dag.smk
    ├── [  96]  schemas
    └── [ 704]  scripts
        ├── [1.6K]  bioinfo_pipelines.Rmd
        ├── [ 261]  common.R
        ├── [ 702]  explore_read_size.R
        ├── [1.8K]  get_sample_gps.R
        ├── [ 395]  mapping_files.Rmd
        ├── [3.7K]  metadata.Rmd
        ├── [ 345]  mothur_design_file.R
        ├── [ 817]  mothur_mapping_file.R
        ├── [ 558]  plot_var_freq.R
        ├── [ 982]  preprocess_tools.Rmd
        ├── [ 112]  render.R
        ├── [ 439]  resize_test_subset.sh
        ├── [ 257]  rules_dag.sh
        ├── [1.0K]  seqkit_stat_1.sh
        ├── [1.4K]  sequencing_data.Rmd
        ├── [ 120]  smk_html_report.sh
        ├── [2.2K]  software.Rmd
        ├── [1.0K]  tidy_metadata.R
        └── [  53]  tree.sh

33 directories, 83 files



Selected bioproject for metadata exploration

  • BioProject number:
    • PRJNA477349: Multispecies 16S rRNA from bushmeat samples collected from Tanzania.

    • PRJNA685168: Multi-omics response to biologic therapies in IBD - BioProject

 [1] "sample_id"   "age"         "sex"         "bmi"         "smoking"    
 [6] "steriods"    "antibiotics" "biologic"    "priortnf"    "wk14"       
[11] "wk52"        "milionbases"



Troubleshooting (in progress)

  1. CiteprocXMLError: Missing root element
    • Maybe the CSL file is empty. Some examples of citation style language are available on Github[14].
  2. Error from srapath: Error: RC(rcNS,rcNoTarg,rcValidating,rcConnection,rcNotFound)
    • This happened when downloading SRA reads from online and it seems to be related to connection issues. Not clear what causes this error but keeping retrying sometimes helps.
    • Alternatively, you can download the data using the NCBI-SraToolkits (fasterq-dump is the fastest).




References

[1]
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., … Nahnsen, S. (2021). Sustainable data analysis with snakemake. F1000Research, 10. https://doi.org/10.12688/f1000research.29032.2
[2]
Snakemake. (2023). Snakemake. Retrieved from https://snakemake.readthedocs.io/en/stable
[3]
Close, W. L. (2020). Mothur 16S v4 analysis pipeline. Retrieved from https://github.com/wclose/mothurPipeline
[4]
R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
[5]
RStudio Team. (2021). RStudio: Integrated development environment for r. Boston, MA: RStudio, PBC. Retrieved from http://www.rstudio.com/
[6]
Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PLoS ONE, 11. https://doi.org/10.1371/JOURNAL.PONE.0163962
[7]
Andrews, S. (2018). FastQC: Https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Retrieved from https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[8]
Ewels, P., Magnusson, M., Lundin, S., & Kaller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32, 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
[9]
Bushnell, B. (2016). BBMap short-read aligner, and other bioinformatics tools. Retrieved from http://sourceforge.net/projects/bbmap/
[10]
Biobakery. (2022). Biobakery kneaddata:quality control tool on metagenomic and metatranscriptomic sequencing data. Retrieved from https://github.com/biobakery/kneaddata
[11]
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J., & Bongcam-Rudloff, E. (2018). SnaInSilicoSeq: A sequencing simulatorkemake. Accessed on february 09, 2023. Retrieved from https://github.com/HadrienG/InSilicoSeq
[12]
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J., & Bongcam-Rudloff, E. (2018). Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 35(3), 521–522. https://doi.org/10.1093/bioinformatics/bty630
[13]
Shen, W. (2022). SeqKit - ultrafast FASTA/q kit. Retrieved from https://bioinf.shenwei.me/seqkit/
[14]
In-GitHub. (2023). Official repository for citation style language (CSL). Accessed on february 06, 2023. Retrieved from https://github.com/citation-style-language/styles